Determination of the best knot and bandwidth in geographically weighted truncated spline nonparametric regression using generalized cross validation

This study proposes the development of nonparametric regression for data containing spatial heterogeneity with local parameter estimates for each observation location. GWTSNR combines Truncated Spline Nonparametric Regression (TSNR) and Geographically Weighted Regression (GWR). So it is necessary to determine the optimum knot point from TSNR and determine the best geographic weighting (bandwidth) from GWR by deciding the best knot point and bandwidth using Generalized Cross Validation (GCV). The case study analyzed the Morbidity Rate in North Sumatra in 2020. This study will estimate the model using knot points 1, 2, and 3 and geographic weighting of the Kernel Function, Gaussian, Bisquare, Tricube, and Exponential. Based on data analysis, we obtained that the best model for Morbidity Rate data in North Sumatra 2020 based on the minimum GCV value is the model using knots 1 and the Kernel Function of Bisquare. Based on the GWTSNR model, the significant predictors in each district/city were grouped into eight groups. Furthermore, the GWTSNR is better at modeling morbidity rates in North Sumatra 2020 by obtaining adjusted R-square = 96.235 than the TSNR by obtaining adjusted R-squared = 70.159. Some of the highlights of the proposed approach are:• The method combines nonparametric and spatial regression in determining morbidity rate modeling.• There were three-knot points tested in the truncated spline nonparametric regression and four geographic weightings in the spatial regression and then to determine the best knot and bandwidth using Generalized Cross Validation.• This paper will determine regional groupings in North Sumatra 2020 based on significant predictors in modeling morbidity rates.


a b s t r a c t
This study proposes the development of nonparametric regression for data containing spatial heterogeneity with local parameter estimates for each observation location. GWTSNR combines Truncated Spline Nonparametric Regression (TSNR) and Geographically Weighted Regression (GWR). So it is necessary to determine the optimum knot point from TSNR and determine the best geographic weighting (bandwidth) from GWR by deciding the best knot point and bandwidth using Generalized Cross Validation (GCV). The case study analyzed the Morbidity Rate in North Sumatra in 2020. This study will estimate the model using knot points 1, 2, and 3 and geographic weighting of the Kernel Function, Gaussian, Bisquare, Tricube, and Exponential. Based on data analysis, we obtained that the best model for Morbidity Rate data in North Sumatra 2020 based on the minimum GCV value is the model using knots 1 and the Kernel Function of Bisquare. Based on the GWTSNR model, the significant predictors in each district/city were grouped into eight groups. Furthermore, the GWTSNR is better at modeling morbidity rates in North Sumatra 2020 by obtaining adjusted R-square = 96.235 than the TSNR by obtaining adjusted R-squared = 70.159. Some of the highlights of the proposed approach are: • The method combines nonparametric and spatial regression in determining morbidity rate modeling. • There were three-knot points tested in the truncated spline nonparametric regression and four geographic weightings in the spatial regression and then to determine the best knot and bandwidth using Generalized Cross Validation. • This paper will determine regional groupings in North Sumatra 2020 based on significant predictors in modeling morbidity rates.
best weighting using Generalized Cross Validation (GCV), a development of Cross Validation (CV). Moreover, we will also provide an algorithm for using this model in analyzing the case studies studied.

TSNR model
Nonparametric regression is one of the regression models used to determine the relationship between the response variable and the predictor variable whose regression curve is unknown. It is a very flexible regression model in modeling data patterns [22] . In general, nonparametric regression models can be presented as follows: is a response variable, is predictor variables, ( ) is an unknown regression curve or does not follow a particular pattern, and ∼ ( 0 , 2 ) . If the regression curve is an additive model and is approximated by a spline function, the regression model is obtained as follows: where , + ℎ are real constants with = 0 , 1 , 2 , … , , ℎ = 1 , 2 , … , and then the truncated function is as follows Where, ℎ is a knot point that shows the shape of the behaviour change of the function at certain sub-intervals. And parameter estimation of the TSNR model was carried out using the maximum likelihood method as follows; Where ̂: Parameters Estimation of TSNR model.
: Vector of the response variable.

GWR model
Fotheringham first introduced GWR in 1967. It is the development of multiple linear regression. The multiple linear regression model has constant parameters at each observation location, while GWR has local parameters at each observation location. In the GWR model, the relationship between the response variables and predictor variables 1 , 2 , … , at location as follows: And parameter estimation of the GWR model was carried out using the maximum likelihood method as follows; ̂∶ Parameters Estimation of GWR model ∶ Matrics of predictor variables ∶ Vector of the response variable. : Matrics of geographic weights.

GWTSNR model
GWTSNR is a development of nonparametric regression for spatial data with local parameter estimators for each observation location. Sifriyani, Gunardi, S.H. Kartiko, and I.N. Budiantara developed the method (2017). GWTSNR is a nonparametric regression approach used to solve spatial analysis problems where the regression curve is unknown. The assumptions used in the GWTSNR model have normally distributed errors, zero mean and variance 2 at each location ( u i , v i ). Mathematically the relationship between the response variable y i and the predictor variable ( x 1 i , x 2 i ,…, x li ) at i -th location for the model can be expressed as follows (Sifriyani et al., 2017); Eq. (7) is the GWTSNR model of degree with areas. The components are described as follows: is response variable on the -th location where = 1 , 2 , … , , is the -th predictor variable in the -th area with = 1 , 2 , … , , ℎ is the ℎ -th knot point on the -th predictor variable component with ℎ = 1 , 2 , … , , ( , ) is a parameter of the polynomial component of the GWTSNR, the -th parameter of the -th predictor variable in the -th area, and , + ℎ ( , ) is a truncated component parameter of GWTSNR, which is the ( + ℎ ) -th parameter, at the ℎ knot point and the -th predictor variable in the -th area.

GWTSNR estimation procedures
Next, we will determine the parameter estimates β ( , ) dan δ ( , ) from the GWTSNR model using Maximum Likelihood Estimation (MLE). The steps are as follows: a. Determine the probability density function of .
. Establish a likelihood function.
c. Establish a weighted likelihood function at the -th location.
( ) : the weighting value at the -th location to the -location.
d. Calculating ln from the weighted likelihood function.

Optimum knot point determination
The knot point is a joint point where there is a change in the behavior pattern of the function or curve. However, the number of knot points will also affect the complexity of the model with the many parameters used so that the proper method is needed to determine the optimal knot point. Optimal knot points can be obtained using the Generalized Cross Validation (GCV). The GCV method is generally defined as follows [ 22 , 23 ]. Where, : the identity matrics : the number of observations

Optimum bandwidth and geographic weights matrics determination
The role of weights in the GWR model is critical because this weighting value represents the location of the observation data with others. Lesage (2001) introduced several weighting methods using Kernel functions, including the Gaussian Kernel, the Exponential Kernel, the Bisquare Kernel, and the Tricube Kernel [24] .
i. Gaussian Where is the standard normal density function and denotes the standard deviation of the distance vector .
ii. Exponential : the distance from -th location to -th location, and ℎ is the bandwidth value, which is a function smoothing parameter value whose value is always positive.
iii. Bisquare And is a known non-negative parameter called bandwidth or smoothing parameter. The optimum bandwidth can be determined using GCV, which is as follows.
Where : GCV value on bandwidth ( ( ℎ ) ) : the sum of the main diagonal elements of the × weight matrix In this study, in selecting the optimum bandwidth using Generalized Cross Validation (GCV). The optimum bandwidth is chosen by finding the smallest GCV. The smallest GCV is generated from the model that has the slightest error.

Spatial heterogeneity test
Differences in characteristics between observation points cause spatial heterogeneity. Identification of spatial homogeneity can be made by using the Breusch-Pagan test. Hypotheses used in the Breusch-Pagan test [25] : is a matrix containing vectors that have been standardized for each observation.

Model fit significance test
The model fit significance test determines whether the GWTSNR model is better than the global model. The following hypothesis is used [26] .
The test statistics used

Simultaneous parameter significance test
A simultaneous test was conducted to determine the significance of the regression model parameters together. The form of the accompanying test hypothesis is as follows [27] .
The test statistics used: Where,

Partial parameter significance test
Individual testing is carried out to determine whether the individual parameters have a significant effect on the response variable, with the following hypothesis: The test statistics used: Where, √ is the + 1 diagonal element of the matrix ( Q ′ W ( , )Q ) −1 ̂2 ( , ) . The test statistic for the GWTSNR model in Eq. 30 will follow a distribution with degrees of freedom − 1 and a significance level of . The rejection area will reject H 0 if the value | | > ( 2 , ( −1 ) ) or − < , which means that the parameter has a significant effect on the model ( Tables 2 and 5 ).

The research steps
The steps of analysis in the research are as follows: a. Describe the morbidity rate in North Sumatra and its predictors. b. Make a scatterplot between the morbidity rate and each predictor to determine the relationship pattern. c. Do spatial heterogeneity tests using the Breusch-Pagan method. d. Calculates the Euclidean distance between the -th location ( , ) and the -th location ( , ) e. Determine the best weighting of the kernel functions, namely Gaussian, Bisquare, Tricube, and Exponential, based on the minimum GCV value. f. Choose the optimum knot point based on the minimum GCV value. g. Get the best GWTSNR model. h. Test the fit model hypothesis between the GWTSNR model and the TSNR model. i. Determine parameter significance tests simultaneously and partially. j. Interpret the GWTSNR model. k. Map 33 districts/cities in North Sumatra based on significant predictor variables.

Morbidity rate
Morbidity is a condition where a person is said to be sick if the perceived health complaints cause disruption of daily activities, namely, unable to carry out work activities, take care of the household, and carry out normal activities as usual. The formula for calculating the morbidity rate is as follows [28] , where, AM : Morbidity Rate JKPP : The number of people who experience health complaints and disruption of activities JP : Total population The morbidity rate in an area is affected by some factors. The determinant factors of morbidity are social, economic, and cultural factors [29] . Based on Wulandari (2017), it was found that population density, the average length of schooling, poverty percentage, regional minimum wage, percentage of open defecation households, and percentage of households with a distance from drinking water sources to sewage storage > 10 meters significantly affected to the morbidity rate in East Java [30] . Based on Hanum (2013), using the Multivariate Geographically Weighted Regression model, it was found that life expectancy, illiteracy rate, percentage of the population with protected water sources from wells, percentage of the population seeking outpatient treatment at health workers, percentage of the population with distance sources of drinking water to sewage storage > 10 meters and the percentage of the population with a monthly per capita expenditure of 200,000 to 299,999 for nutritious food significantly affected on morbidity rates [31] . According to Gordon (1954), the morbidity rate was influenced by environmental factors consisting of the biological environment, the physical environment, the socio-economic environment, maternal education level, and health services [32] .
Based on the description above, in this research, several predictor variables were used, which were thought to have an effect on the morbidity rate in North Sumatra in 2020. The variables are followed as follows: Y

Characteristics of morbidity rates in North Sumatra
North Sumatra is the second-largest province on Sumatra Island. The population of North Sumatra in 2020 reached 14,799,361 people. 14,799,361 people inhabited the North Sumatra area of 72,981.23 km 2 , and the average population density of North Sumatra It can be seen that for a period of six years, from 2015 to 2020, the morbidity rates in North Sumatra Province were consistently below the national figure. All variables ranging from the response variable to the seven predictor variables that are thought to affect the average, variance to the minimum, and maximum values are calculated.

Fig. 2. Description information based on variable data mapping. The lowest Y is in Humbang Hasundutan Regency, and the highest Y is in Batubara
Regency. The lowest X 1 is in Deli Serdang Regency, and the highest X 1 is in West Nias Regency. The lowest X 2 is in South Nias Regency, and the highest X 2 is in Binjai City. The lowest X 3 is in Pakpak Bharat Regency, and the highest X 3 is in Medan City. The lowest X 4 is in Humbang Hasundutan Regency, and the highest X 4 is in Gunungsitoli City. The lowest X 5 is in West Nias Regency, and the highest X 5 is in Medan City. The lowest X 6 is in Padang Sidimpuan Regency, and the highest X 6 is Pematangsiantar City. And the lowest X 7 is Nias, and the highest X 7 is in Medan City.
was 202.78 people per square kilometer. In 2020 the morbidity rate in North Sumatra reached 12.24. It means that there are 12 out of 100 residents in North Sumatra who experience illness complaints ( Figs. 1-3 ).
The Performance Report of the Government Agencies of the Health Office of North Sumatra Province in 2020 is explained similarly to national conditions. In the last five years, the Sickness Rate in North Sumatra was 11.84% in 2015, decreasing to 11.15% in 2015. 2016, to 11.17% in 2017, then decreased to 11.03% in 2018, but in 2019 it increased to 11.97% and increased again in 2020 to 12.24%, as shown in the following graph The results of the calculation of descriptive statistics can be presented in Table 1 below. Furthermore, a mapping of the data information used will be given as follows.  It can be seen that the VIF value < 10 in all independent variables, and it can be concluded that there is no multicollinearity between the predictor variables used in this study so that the predictor variables in this study can be used in the formation of a regression model.  Table 3 shows that the > 2 ( ) = 2 (7) = 14 . 0671404 or − = 0 . 00166 < 0 . 05 then 0 is rejected. In other words, the variance between locations is different (heterogeneous), or there are differences in characteristics between one observation point and another.  Table 4 shows that there is spatial heterogeneity, with the optimum bandwidth value of 1.499684794 using the Bisquare Kernel weighting function based on the minimum GCV value. Based on the minimum GCV value, the best model is the GWTSNR model with the one-knot point with a GCV value of 8.227800831.

Data patterns between morbidity rates and predictors
Next, a scatterplot of the morbidity rate and the factors that influence it will be presented to see the pattern of relationships between the dependent variables on all independent variables. If the resulting plot forms a certain pattern, then parametric regression is good. Nonparametric regression is appropriate if it does not follow a certain pattern.

Multicollinearity test
Ragnar Frisch first coined the term multicollinearity. The multicollinearity test is a requirement for all causality (regression) hypothesis tests. Multicollinearity will be detected if the value of ( , ) > 10 . The value of ( , ) is stated as follows: where 2 ( , ) is the coefficient of determination of the -th variable at the -th location. The following is the value of the seven independent variables used in this study:

Spatial heterogeneity test and the best weights matrix
The existence of differences in characteristics between location points causes spatial heterogeneity, so spatial weighting is needed. The best spatial weighting is obtained from the bandwidth value, which has the minimum Generalized Cross Validation (GCV) value. The following are the results of spatial heterogeneity testing and the selection of the best bandwidth.

Best knot point selection
The next step in determining the best model is determining the knot point. The knot point is the point where the data pattern changes. The following table shows the GCV value at each knot point.

Parameter estimation of morbidity rate model in North Sumatra in 2020
Based on the results of selecting the optimum knot point, the following parameter estimators from the GWTSNR model with the one-knot point.
The following is the GWTSNR model, which is written as an example of the 30th location, namely Medan City. Thus, it can be concluded that there is a significant difference between GWTSNR model and TSNR model.

Simultaneous parameter significance test
Simultaneous testing is carried out to test the estimation of model parameters simultaneously. The following are the results of the ANOVA simultaneous parameter test.
Thus it can be concluded that there is at least one parameter in the GWTSNR model that is significant to the response variable or, in other words, poverty percentage, the percentage of households who have access to proper sanitation, population density, open unemployment rate, and general hospital, percentage of households with access to resources adequate drinking water, and the average length of school have a simultaneous effect on the morbidity rate in North Sumatra 2020.

Partial parameter significance test
The calculation results from the partial parameter significance test show that the predictor variables that have an effect differ for each area. This resulted in 8 districts/cities mapping groups based on influential predictor variables. The grouping of districts/cities based on variables significant to the 2020 morbidity rate in North Sumatra is given as follows.  6. The morbidity rate in Sibolga City is affected by 4 and 5 . 7. The morbidity rate in Asahan is affected by 1 and 5 . 8. The morbidity rate in Tapanuli Tengah is affected by 5 .
The mapping of Morbidity Rates can be presented in Fig. 4 .

Model interpretation
After doing a partial significance test with 13 regional groups for significant parameters, the best model interpretation is carried out, namely the GWTSNR model with the one-knot point at the 30th location of Medan City in Eq. 34 .
The interpretation of the above model is explained as follows.
a. Assuming the predictors ( 2 , 3 , 4 , 5 , 6 , 7 ) are constant, then the effect of the poverty percentage variable ( 1 ) on the morbidity rate in 2020 (y) in Medan City can be written as follows. Based on the model obtained, it can be interpreted that if the poverty percentage is less than 4.3162%, then every 1% increase in the poverty percentage will reduce the morbidity rate by 19.077%. Meanwhile, if the poverty percentage is more than or equal to 4.3162 %, then every 1% increase in the poverty percentage will increase the morbidity rate by 0.155 %. Because the poverty percentage in Medan City in 2020 is 8.01, if the poverty percentage increases, the morbidity rate will also increase. b. Assuming the predictors ( 1 , 3 , 4 , 5 , 6 , 7 ) are constant, then the effect of the percentage of households that have access to proper sanitation ( 2 ) on the morbidity rate in 2020 (y) in Medan City can be written as follows. Based on the model obtained, it can be interpreted that if the percentage of households with access to proper sanitation is less than 13.1744%, it increases by 1%. The morbidity rate will increase by 0.450%. Meanwhile, if the percentage of households with access to proper sanitation is more than or equal to 13.1744%, it increases by 1%. The morbidity rate will decrease by 0.302%. Because the percentage of households that have access to proper sanitation in Medan City in 2020 is 93.16, if the percentage of households that have access to proper sanitation increases, the morbidity rate will decrease. c. Assuming the predictors ( 1 , 2 , 4 , 5 , 6 , 7 ) are constant, then the effect of the population density ( 3 ) on morbidity rates in 2020 (y) in Medan City can be written as follows. Based on the model obtained, it can be interpreted that if the population density is less than 225,903 then it increases by 1 unit, and the morbidity rate will decrease by 0.0075%. Meanwhile, if the population density is more than or equal to 225.903 there is an increase of 1 unit, then the morbidity rate will increase by 0.0016%. Because the population density in Medan City in 2020 is 9189.63, if the population density increases, the morbidity rate will also increase. Based on the model obtained, it can be interpreted that if the number of public hospitals is less than 1.34 then it increases by 1 unit, then the morbidity rate will increase by 8,749%. Meanwhile, if the number of public hospitals is more than or equal to 1.34 then it increases by 1 unit, then the morbidity rate will decrease by 0.119%. Because the number of public hospitals in Medan City in 2020 is 67, if the number of public hospitals experiences more units, the morbidity rate will decrease.
f. Assuming the predictors ( 1 , 2 , 3 , 4 , 5 , 7 ) are constant, then the effect of the percentage of households that have access to safe drinking water ( 6 ) on the morbidity rate in 2020 (y) in Medan City can be written as follows. Based on the model obtained, it can be interpreted that if the percentage of households that have access to safe drinking water is less than 44,4184%, then it will increase by 1%, and then the morbidity rate will increase by 2,092%. Meanwhile, if the percentage of households that have access to proper drinking water is more than or equal to 44,4184 then it increases by 1%, then the morbidity rate will increase by 0.128%. Because the percentage of households that have access to safe drinking water in Medan City in 2020 is 98.79, if the percentage of households that have access to safe drinking water increases, the morbidity rate will increase. g. Assuming the predictors ( 1 , 2 , 3 , 4 , 5 , 6 ) are constant, then the effect of the length of schooling ( 7 ) on the morbidity rate in 2020 (y) in Medan City can be written as follows. Based on the model obtained, it can be interpreted that if the average length of schooling is less than 5.4806 years, then it increases for one year, and then the morbidity rate will decrease by 1.853%. Meanwhile, if the average length of schooling is more than or equal to 5,4806 the next year, it increases for one year, then the morbidity rate will decrease by 4,354%. Because the average length of schooling in Medan City in 2020 is 11.39, if the average length of schooling increases, the morbidity rate will decrease.

Model comparison
Based on the modeling, the level of goodness of the model is obtained based on the coefficient of determination from the GWTSNR model with the TSNR as the global regression as follows. Based on data analysis, it is obtained that the GWTSNR model with a one-knot point has a determination coefficient adjusted r-square of 96.235%, which is greater than the coefficient of determination (adjusted r-square) from TSNR model with a one-knot point is 70.159%. This indicates that the GWTSNR model with a one-knot point is the best model by being able to explain the effect of the predictor variables 1 , 2 , 3 , 4 , 5 , 6 , and 7 to the morbidity rate variable of 96.235%.

Conclusion
Morbidity rate data in North Sumatra 2020 has regression curves between predictor variables and the response variable does not determine a certain pattern. And morbidity rate in North Sumatra 2020 also has a spatial effect. There are eight regional groupings based on significant predictors with different effects on each group. Modeling the morbidity rate using the GWTSNR model with oneknot point has a coefficient of determination (adjusted r-square) of 96.235% which is greater than the coefficient of determination (adjusted r-square) of TSNR with one-knot point of 70.159%. This indicates that the GWTSNR model with a one-knot point is the best model by being able to explain the effect of the predictor variables 1 , 2 , 3 , 4 , 5 , 6 and 7 on the morbidity rate of 96.235%.

Ethics statements
The data used in this study is the morbidity rate data in North Sumatra in 2020. The data is secondary data accessed from the official website of the North Sumatra Provincial Health Office (https://bit.ly/3a3g2Xw) in the publication of North Sumatra Provincial Health Office Performance Report in 2020. The predictor variables used in this study are secondary data accessed on the website of the Central Statistics Agency of North Sumatra (https://sumut.bps.go.id/), or there are in the publication of BPS North Sumatra with the title North Sumatra Province in Figures 2021.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Data will be made available on request.